Introduction

This project began as an extension of a DataCamp course on business analytics using Microsoft Excel. The course used an Excel Workbook with global sales and client data for athletic wear and taught how to create basic graphs and tables in Excel. After completing the course, I challenged myself to see if I could recreate the graphs and tables using RStudio, which is my analytics tool of choice.

After completing this challenge, I went a step further and added to my analytics by creating interactive graphs and global heat maps of sales to add even more analytics and insight to a potential colleague, boss, or investor. Below are the various graphs and tables with ample descriptions of what I created and why I thought it was important.

Data

The data used for the following analytics comes from a DataCamp course on Microsoft Excel, which can be found here. I completed the first chapter of the course, which walks through data cleaning, sheet renaming, and sheet reorganization, which is necessary to complete prior to following along with my analytics below (the first chapter of the DataCamp course is free to complete). After completing the course, I decided to load in the “Orders” sheet from the DataCamp workbook to conduct my analytics.

# load in pacman
library(pacman)

# load in necessary packages
p_load(readxl,janitor,ggplot2,plotly,ggthemes,dplyr)

# load in orders sheet from DataCamp Workbook
sales_data <- read_excel("~/github_repos/projects-and-work-samples/athletic_wear_sales_analytics_project_files/data/sales_data.xlsx",
                         sheet = "Orders")
sales_data
## # A tibble: 1,269 × 24
##    Order I…¹ Order…² `Order Date`        Days …³ Late_…⁴ Shipp…⁵ Categ…⁶ Custo…⁷
##        <dbl>   <dbl> <dttm>                <dbl>   <dbl> <chr>     <dbl>   <dbl>
##  1      6176    2015 2015-01-04 00:00:00       4       0 Standa…      17    3329
##  2     10384    2015 2015-01-06 00:00:00       4       1 Standa…      17     587
##  3     14551    2015 2015-01-08 00:00:00       2       0 Second…      17    2028
##  4     14574    2015 2015-01-08 00:00:00       2       1 Second…      18    6594
##  5     22924    2015 2015-01-12 00:00:00       2       1 Second…      29    9704
##  6       906    2015 2015-01-14 00:00:00       4       0 Standa…      40    7141
##  7       973    2015 2015-01-15 00:00:00       4       0 Standa…      29    5118
##  8      1077    2015 2015-01-16 00:00:00       2       1 Second…       9    8103
##  9      1105    2015 2015-01-17 00:00:00       4       1 Standa…      37    9760
## 10      1186    2015 2015-01-18 00:00:00       4       0 Standa…       9   11947
## # … with 1,259 more rows, 16 more variables: `Department Id` <dbl>,
## #   `Department Name` <chr>, Market <chr>, `Order City` <chr>,
## #   `Order State` <chr>, `Order Zipcode` <dbl>, `Order Country` <chr>,
## #   `Order Region` <chr>, `Product Category Id` <dbl>, `Product Id` <dbl>,
## #   `Product Price` <dbl>, `Product Cost` <dbl>, `Order Quantity` <dbl>,
## #   `Order Total Discount` <dbl>, Sales <dbl>, `Payment Type` <chr>, and
## #   abbreviated variable names ¹​`Order Id`, ²​`Order Year`, …

Cleaning column headers

The data above contains all sorts of categorical and numerical data regarding individual orders of athletic wear. Lots of potential for analytics! Next, I will clean the data a bit, focusing on the column titles and making them more r-friendly (meaning lowercase letters and underscores instead of spaces). I will use the janitor package to quickly clean the column names using the clean_names() function.

#clean column names
sales_data <- clean_names(sales_data)

# view data
sales_data
## # A tibble: 1,269 × 24
##    order_id order_…¹ order_date          days_…² late_…³ shipp…⁴ categ…⁵ custo…⁶
##       <dbl>    <dbl> <dttm>                <dbl>   <dbl> <chr>     <dbl>   <dbl>
##  1     6176     2015 2015-01-04 00:00:00       4       0 Standa…      17    3329
##  2    10384     2015 2015-01-06 00:00:00       4       1 Standa…      17     587
##  3    14551     2015 2015-01-08 00:00:00       2       0 Second…      17    2028
##  4    14574     2015 2015-01-08 00:00:00       2       1 Second…      18    6594
##  5    22924     2015 2015-01-12 00:00:00       2       1 Second…      29    9704
##  6      906     2015 2015-01-14 00:00:00       4       0 Standa…      40    7141
##  7      973     2015 2015-01-15 00:00:00       4       0 Standa…      29    5118
##  8     1077     2015 2015-01-16 00:00:00       2       1 Second…       9    8103
##  9     1105     2015 2015-01-17 00:00:00       4       1 Standa…      37    9760
## 10     1186     2015 2015-01-18 00:00:00       4       0 Standa…       9   11947
## # … with 1,259 more rows, 16 more variables: department_id <dbl>,
## #   department_name <chr>, market <chr>, order_city <chr>, order_state <chr>,
## #   order_zipcode <dbl>, order_country <chr>, order_region <chr>,
## #   product_category_id <dbl>, product_id <dbl>, product_price <dbl>,
## #   product_cost <dbl>, order_quantity <dbl>, order_total_discount <dbl>,
## #   sales <dbl>, payment_type <chr>, and abbreviated variable names
## #   ¹​order_year, ²​days_for_shipment_scheduled, ³​late_delivery_risk, …

Now the column headers are all lowercase and have underscores instead of spaces!

Graphs

Africa total sales over time

Regular graph

Now that the data is clean and easier to work with, I want to see how total annual sales have changed in certain markets. First I will look at the sales in Africa.

sales_data %>% 
  # filter for only observations in Africa
  filter(market=="Africa") %>% 
  # group by order year so we can plot one point for each year
  group_by(order_year) %>% 
  # create a sum of the 'sales' column after grouping by each year to obtain the total sales in each year
  mutate(total_annual_sales = sum(sales)) %>% 
  # create graph
  ggplot(mapping = aes(x = order_year, 
                       y = total_annual_sales)) +
  geom_point(color = "firebrick", size = 3) +
  geom_line(color = "firebrick", linewidth = 1.5) +
  # use 'ggthemes' package to add a professional looking theme to the graph
  theme_stata() +
  # add labels and title
  labs(title = "Annual Sales in Africa",
       x = "Year",
       y = "Total Annual Sales (USD)")

Looks like Africa only has two years worth of data, in 2016 and 2017, and it’s clear that sales declined over that time period. I can’t tell from this graph what the exact sales values are, but I can venture an educated guess. If I want to interact with the graph, I can wrap the ggplotly function from the plotly package around my ggplot() call from above to see exactly what the values are for each of the two data points.

Interactive graph

ggplotly(sales_data %>% 
  # filter for only observations in Africa
  filter(market=="Africa") %>% 
  # group by order year so we can plot one point for each year
  group_by(order_year) %>% 
  # create a sum of the 'sales' column after grouping by each year to obtain the total sales in each year
  mutate(total_annual_sales = sum(sales)) %>% 
  # create graph
  ggplot(mapping = aes(x = order_year, 
                       y = total_annual_sales)) +
  geom_point(color = "firebrick", size = 3) +
  geom_line(color = "firebrick", linewidth = 1.5) +
  # use 'ggthemes' package to add a professional looking theme to the graph
  theme_stata() +
  # add labels and title
  labs(title = "Annual Sales in Africa",
       x = "Year",
       y = "Total Annual Sales (USD)"))

The graph looks exactly the same, but if you hover over the data points we can see the actual total sales amount for 2016 and 2017. Now I see that sales fell from $40,484.64 in 2016 to $ 6,477.34 in 2017.

Europe total sales over time

Regular graph

Next, I want to look at the annual sales in Europe. First, I’ll start with the standard ggplot() graph.

sales_data %>% 
  # filter for European observations
  filter(market == "Europe") %>% 
  # group by 'order_year' to calculate total sales per each year
  group_by(order_year) %>% 
  # Create total_sales value that is sum of sales in each year
  mutate(total_annual_sales = sum(sales)) %>% 
  # create graph
  ggplot(aes(x = order_year,
             y = total_annual_sales)) +
  geom_point(color = "steelblue3",
             size = 3) +
  geom_line(color = "steelblue3",
            linewidth = 1.5) +
  # add theme using 'ggthemes' package
  theme_stata() +
  # add labels
  labs(title = "Annual Sales in Europe",
       x = "Year",
       y = "Total Annual Sales")

Europe has one more years worth of data than Africa, with data from 2015-2017, and has additional volatility over that time period. Like I did with the Africa graph, I will use the ggplotly() function here to observe the sales values from each year.

Interactive graph

ggplotly(sales_data %>% 
  # filter for European observations
  filter(market == "Europe") %>% 
  # group by 'order_year' to calculate total sales per each year
  group_by(order_year) %>% 
  # Create total_sales value that is sum of sales in each year
  mutate(total_annual_sales = sum(sales)) %>% 
  # create graph
  ggplot(aes(x = order_year,
             y = total_annual_sales)) +
  geom_point(color = "steelblue3",
             size = 3) +
  geom_line(color = "steelblue3",
            size = 1.5) +
  # add theme using 'ggthemes' package
  theme_stata() +
  # add labels
  labs(title = "Annual Sales in Europe",
       x = "Year",
       y = "Total Annual Sales"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

With the interactive graph above, I see that sales in Europe were equivalent to $21,717.55 in 2015, decreasing to $2,964.55 in 2016, and rising finally to $36,172.24 in 2017.

Department sales

Bar graph of sales values by department

The next analytics I am interested in conducting is a graph of all sales in each department to get a better idea as to which departments can generate the most revenue for this company per order. I will create a bar graph that visualizes this data, opting this time to only show the interactive plot so I can observe the exact sales values for each order from each department.

# wrap in ggplotly() to make interactive
ggplotly(sales_data %>% 
# use reorder() in aes() function to put bars in desc. order (-sales means desc. order by sales variable)
           ggplot(aes(department_name,
                      sales)) +
# add bars. "stat = 'identity'" allows me to change the y axis from count (default setting) to sales variable from data
           geom_bar(stat = "identity", fill = "skyblue3", position = 'dodge') +
  theme_stata() +
  labs(title = "Revenue values by Department",
         y = "Sales (USD)", 
         x = "Department")) 

This graph offers a lot in terms of actionable insights. By moving the cursor up inside of each bar, I can see the dollar value of each order from that particular department. Clearly, the technology department can offer the most in terms of revenue, as it has the most expensive orders in all the data, with fitness being a distant second. In order to make a decision on revenue, however, I will next create a bar graph detailing the total sales per department and see if the results are the same or different from this graph.

Bar graph of total sales by department

In order to tell which department is responsible for the most sales, or revenue, I will need to calculate the total sales by department first, then create a graph similar to the ones above, but using the newly created total_sales value instead of the normal sales value.

# create new column for total sales grouped by department
sales_data <-sales_data %>% 
  group_by(department_name) %>%
  mutate(total_department_sales = sum(sales))

# create graph
ggplotly(
  ggplot(sales_data,
         aes(department_name,total_department_sales)) +
    geom_bar(stat = 'identity',
               position = 'dodge',
               fill = 'palevioletred') +
    theme_stata() +
    labs(title = "Total Revenue by Department",
         y = "Total Sales (USD)", 
         x = "Department")
  )

Now this graph tells a very different story from the previous one. In the previous graph, we saw that the technology department had the highest order values of any department, but this new graph shows that the technology department accounts for very little in terms of total revenue. Apparel is by far bringing in the most revenue, followed by Fitness, Footwear, and Golf all bringing in similar amounts of revenue. So, even though the technology department has some of the most expensive orders, these are a handful of orders that, when compared to the massive quantities of apparel orders, is dwarfed in terms of the amount of revenue generated.

Department sales in each market

Now that we have seen the individual and total sales values across departments, let’s add another element, the market, to see how departmental sales vary geographically. This will provide us with insight as to where geographically each department is generating the most revenue.

# wrap in ggplotly() to create interactive graph
ggplotly(
  sales_data %>% 
    # group by market
    group_by(market, department_name) %>% 
    # create total sales value by summing sales for each       market
    summarise(total_market_sales = sum(sales)) %>% 
    # graph total sales by market
    ggplot(aes(x = market, y = total_market_sales, fill = department_name)) +
    geom_bar(stat = "identity",
             position = "dodge") +
    labs(title = "Total Department Sales by Market",
         y = "Total Sales (USD)",
         x = "Market",
         fill = "Department Name") +
    scale_fill_colorblind()
  )
## `summarise()` has grouped output by 'market'. You can override using the
## `.groups` argument.